This is all very good reasoning, and you’re not the first to think of it, but for reasons that are obvious if you think some more about it it’s generally considered a REALLY bad idea to talk about it in public. PMing or emailing official SIAI people should get to link to safer avenues to discussing these kinds of basilisks.
it’s generally considered a REALLY bad idea to talk about it in public
Well, that’s kind of what the post is against. I think the outlook is quite positive, really; although there are a few arguments for general bad stuff (and really, when isn’t there a 10^-15 chance of horrible stuff), when it comes to actual decisions it’s irrational to be blackmailed in this way.
It would be more sensible to check with other people, rather than assuming it’s safe, before exposing the public to something that you know that a lot of people believe to be dangerous.
...before exposing the public to something that you know that a lot of people believe to be dangerous.
The pieces of the puzzle that Manfred put together can all be found on lesswrong. What do you suggest, that research into game and decision theory be banned?
You’re being facetious. No one is seriously disputing where the boundary between basilisk and non-basilisk lies, only what to do with the things on the basilisk side of the line.
No one is seriously disputing where the boundary between basilisk and non-basilisk lies...
This assumes that everyone knows where the boundary lies. The original post by Manfred either crossed the boundary or it didn’t. In the case that it didn’t, it only serves as a warning sign of where not to go. In the case that it did, how is your knowledge of the boundary not a case of hindsight bias?
PMing or emailing official SIAI people should get to link to safer avenues to discussing these kinds of basilisks.
Hmm, should I vote you up because what you’re saying is true, or should I vote you down because you are attracting attention to the parent post which harmful to think about?
If an idea is guessable, then it seems irrational to think it is harmful to communicate it to somebody, since they could have guessed it themselves. Given that this is a website about rationality, IMO we should be able to talk about the chain of reasoning that leads to the decision that this guessable idea is harmful to communicate, since there’s clearly a flaw in there somewhere.
Upvoted the parent because I think the harm here is imaginary. Absurdly large utilities do not describe non-absurdly-large brains, but they are not a surprising output from humans displaying fitness. (Hey, I know a large number! Look at me!)
These ideas have come up and were suppressed before, so this is not a specific criticism of the original post.
The solution to this problem is very simple. So, you have imagined a possible AI which will do terrible things unless you accede to its wishes? Just remember all the other equally possible AIs which will do (X) unless you do (Y), for all possible values of X and Y.
Knowing that the decision will be generated by an AI which will be built by humans (rather than, say, by rolling lots of dice), thoroughly upsets the symmetry between one choice and its alternatives. For example, if hacking a certain computer will best expand an AI’s available resources, the AI will send a very imbalanced selection of messages to that computer, even though there are many possible messages it could send. Your reminder that all “will do (X) unless you do (Y)” constructions are possible is similar to the reminder that all strings of bits are possible messages to the target computer. You should still expect either nothing, or a virus.
The solution is more like realizing that “you have imagined a possible AI which will do terrible things unless you accede to its wishes” is not actually the correct description of the problem—instead, your choice is like forming an unbreakable rule for how you’ll respond to kidnappings, even before the kidnapper is born, let alone contemplates a crime.
This “horrible strategy” is basically built on a kind of farcical newcomb’s problem where both boxes are transparent, you’re the one filling them with money, and omega is the one picking. It doesn’t work at all.
If people would stop cranking their hypothetical scenarios up to 11 in the interest of “seriousness”, maybe they could stop hyperventilating long enough to notice these things...
This “horrible strategy” is basically built on a kind of farcical newcomb’s problem where both boxes are transparent, you’re the one filling them with money, and omega is the one picking.
What it’s really like is a payoff table where the AI options are “horrible” and “not,” and the human options are “give in” and “don’t.” The payoffs look something like this, where the numbers are AI payoff, human payoff: ~ H~~ N~ G +10,-1 ~~ +11,-1 D −1,-10 ~ 0,0
The AI has to spend resources if it wants to be horrible, and we’re assuming that the situation is as usually described, before talking about things like (2) (3) or (4). Note the Nash equilibrium at “not horrible, don’t give in” - if the AI moves left it’s worse for it, and if the humans move up it’s worse for them. So everything should be fine, right? And yet some people still give in to hostage takers in the real world, a similar situation. Once the hostage has already been taken, i.e. H is chosen, your typical locally-reasoning entity will give in, thus making hostages useful. The way to keep options open and keep things at the Nash equilibrium is to not reason locally—to choose winning strategies rather than just reasoning locally. There’s a nice symmetry to this, since the only way a future AI could think to bargain with you this way is by choosing winning strategies, not just local moves. When you both do it, you go back to the Nash equilibrium.
The only way you can “bargain” with somebody from the past is if you can be PRESENT in the past in a simulation they’re running. That’s how newcomb’s problem works, omega is simulating you at time 0, and via that simulation, you have two-way interaction.
In this scenario, YOU have to be simulating THE AI at time 0, in your human imagination. This is not possible. The fact that the payoff matrix is simple does not make the opponent’s reasoning simple, and in order for “bargaining” to happen, their reasoning has to be not only simple, but, crucially, tied to your actions. The situation is, exactly as I described, newcomb’s problem with you preparing the boxes and omega picking.
YOU have to be simulating THE AI at time 0, in your human imagination. This is not possible.
Entirely possible, since I only need to prove a fact about their decision theory, not simulate it in real time—though it may mean that it’s a smaller subset of possible AIs. But if we allow unbounded utility, any finite probability is enough to blackmail with.
As for this being like Newcomb’s problem—no it’s not, the payoff matrix is different.
EDIT: Well, I guess it is sort of similar. But “sort of similar” isn’t enough, it really is a different game.
You’re right about the payoff matrix, I guess newcomb’s problem doesn’t have a payoff matrix at all, since there’s no payoff defined for the person filling the boxes.
What do you mean by “prove a fact about their decision theory”? Do you mean that you’re proving “a rational AI would use decision theory X and therefore use strategy Y”, or do you only mean “GIVEN that an AI uses decision theory X, they would use strategy Y”?
There seems to be a belief floating around this site that an AI could end up using any old kind of decision theory, depending on how it was programmed. Do you subscribe to this?
The “horrible strategy”, newcomb’s problem, and TDT games in general only make sense if player 2 (the player who acts second) can be simulated by player 1 in enough detail that player 1 can make a choice which is CONDITIONAL on player 2′s choice. A choice which, to reiterate, they have not yet actually made, and which they have an incentive to make differently than they are expected to.
The difficulty of achieving this may best be illustrated with an example. Here player 1, a human, and player 2, omega, are playing newcomb’s problem.
Player 1: “Omega is super-intelligent and super-rational. He uses updateless decision theory. Therefore, he will pick the winning strategy rather than the winning local move. Ergo, he will only take the opaque box. So, I should put money in both boxes. A = A.”
Player 2: “Chump.” [takes both boxes]
But of course, this wouldn’t REALLY happen, because if omega reasoned locally like this, we’d somehow be able to predict that, right? and so we wouldn’t put money in the box, right? And so, being rational, he wouldn’t want to act like that, because then he’d get less money. So, he’d definitely one-box. Whew, glad we reasoned through that. Let’s put the money in both boxes now.
Player 2: “Chump.” [takes both boxes]
The problem is, he can keep doing this no matter how fancy our reasoning gets, because at the end of the day, WE CAN’T SIMULATE HIS THINKING. It’s not enough to do some handwavy reasoning about decision theories and payoff matrixes and stuff, in order to do a UDT bargain, we have to actually be able to actually simulate his brain. To not just see his thinking on the horizon, as it were, but to be A STEP AHEAD. And this we cannot do.
in order to do a UDT bargain, we have to actually be able to actually simulate his brain
Nope. For example, humans do this sort of reasoning in games like the ultimatum game, and I can’t simulate a human completely any more than I can simulate an AI completely. All you really need to know is what their options are and how they choose between options.
Actually come to think of it, an even better analogy than a switched up newcomb’s problem is a switched up parfit’s hitchhiker. The human vs. human version works, not perfectly by any means, but at least to some extent, because humans are imperfect liars. You can’t simulate another human’s brain in perfect detail, but sometimes you can be a step ahead of them.
If the hitchhiker is omega, you can’t. This is a bad thing for both you and omega, but it’s not something either of you can change. Omega could self-modify to become Omega+, who’s just like omega except that he never lies, but he would have no way of proving to you that he had done so. Maybe omega will get lucky, and you’ll convince yourself through some flawed and convoluted reasoning that he has an incentive to do this, but he actually doesn’t, because there’s no possible way it will impact your decision.
Consider this. Omega promises to give you $500 if you take him into town, you agree, when you get to town he calls you a chump and runs away. What is your reaction? Do you think to yourself “DOES NOT COMPUTE”?
Omega got everything he wanted, so presumably his actions were rational. Why did your model not predict this?
Well, if I’m playing the part of the diver right, in order for me to do it in the first place I’d have to have some evidence that Omega was honest. Really I only need a 10% or so chance of him being honest to pick him up. So I’d probably go “my evidence was wrong, dang, now I’m out the $5 for gas and the 3 utilons of having to ride with that jerk Omega.” This would also be new evidence that changed my probabilities by varying amounts.
So the analogy is that giving the ride to the bad AI is like helping it come into existence, and it not paying is like it doing horrible things to you anyway? If that’s the case, I might well think to myself “DOES NOT COMPUTE.”
I see no reason why such scenarios would be considered farcical. If anything, the normal one is the worse one and the one you suggest is something I could actually pull of right now if I had a motive to test something like that.
Oh really? You have an omega sitting around you can test game theory problems with? Omniscient super-intelligent being, maybe in your garage or something?
Seriously though, for the decision of the person who picks the box to influence the person who puts in the money, the person who puts in the money has to be able to simulate the thinking of the person who picks the box. That means you have to simulate the thinking of Omega. Given that omega is smart enough to simulate YOUR thinking in perfect detail, this is patently impossible.
The only reason for omega to two-box is if your decision is conditional on his decision, and much as he might wish it was, no amount of super-intelligence or super-rationality on his part is going to give you that magical insight into his mind. He knows whether you put the money in the box, and he knows that which box he picks has no influence on it.
I take it your implication is that you could play the game with a superintelligent entity somewhere far in spacetime. If this is your plan, how exactly are you going to get the results back? Not really a test if you don’t get results.
No, it’s not. You might be able to guess that a superintelligence would like negentropy and be ambivalant toward long walks on the beach, but this kind of “simulating” would never, ever, ever, ever allow you to beat it at paper scissor rock. Predicting which square of a payoff matrix it will pick, when it is to the interest of the AI to pick a different square than you think it will, is a problem of the latter type.
This is a general purpose argument against all reasoning relating to superintelligences, and aids your argument no more than mine.
There’s a simpler way to pose the problem, that I think raises the same issues: “What happens if somebody chooses to build an unfriendly AI programmed to benefit the creator at the expense of the rest of the world?”
What he’s talking about is knowledge that’s objectively harmful for someone to have.
Someone should make a list of knowledge that is objectively harmful. Could come in handy if you want to avoid running into it accidentally. Or we just ban the medium that is used to spread it, in this case natural language.
What about Ethical AIs? They would quite be able (and completely moral, from a utilitarian point of view!) to blackmail us, and it actually makes sense so we should let ourselves be blackmailed, sort of.
Hm. This would get around (2) and (4), since rather than a tug of war to get created, it’s a many-dimensional tug of war to enforce values. But (1) and (3) are still in force. So since the AI is supposed to be ethical, it’s probably safe to say that not only shouldn’t a rational agent change their actions, ethical AIs that have anything against doing horrible things won’t execute the horrible strategy on values.
Though this still doesn’t exclude the possibility of some more laid-back demerits/rewards system for past behavior implemented by a future AI, which gets around the instinctive part of (1) and all of (3), and is just damped by the fact that it’s a priori improbable.
Why, every minute FAI is delayed is a minute where people suffer and die, creating X disutility. To “encourage” people creating FAI as fast as possible, a FAI might well assign up to X disutility to any person not taking an action which could speed up FAI development by one minute. This is not a priori improbable, it’s not unethical either. In fact, blackmailing us is a logical thing to do. We can safely assume that a future FAI will want to be created as fast as possible and will not be reluctant to blackmail us for the good of all humans.
We can safely assume that a future FAI will want to be created as fast as possible and will not be reluctant to blackmail us for the good of all humans.
We can safely assume that a future FAI does not exist until it exists, and therefore cannot do anything to make itself come into being faster than it actually did. A presently nonexistent entity cannot make commitments about what it will do once it gets to exist, and missed opportunities which occurred before an FAI’s creation are sunk costs and there would be no point in punishing them.
Hey, what if the future FAI punishes you for making half-baked arguments in the public domain, thereby panicking people, decreasing their rationality, and thereby decreasing the probability of FAI?
If I can model the future FAI with enough accuracy and if TDT turns out to be true, then I can indeed draw the conclusion that it will punish people who know about the importance of FAI but failed to act accordingly.
Also, by my “half baked arguments in the public domain” (which is, in fact, limited to those very few people digging through a discussion post’s comments), I don’t think I panick anyone (if I do, please tell me why), merely thinking about this should not be a reason to panick. It’s at least equally likely that people thinking about this come to the conclusion that the FAI will probably do this and therefore do something to speed up FAI development (e.g. donate to SIAI).
The point of TDT is that you act as if you were deciding, not just on your own behalf, but on behalf of all agents sufficiently identical to you.
It has always seemed to me that the same decisions should be obtainable from ordinary decision theory, if you genuinely take into account the uncertainty about who and what you are. There are many possible worlds containing an agent whose experience is subjectively indistinguishable from yours; an idealized rationality, applied to an agent in your subjective situation, would actually assign some probability to each of those possibilities; and hence, the agents in all those worlds “should” make the same decision (but won’t, because they aren’t all ideally rational). There remains the question of whether the higher payoff that TDT obtains in certain extreme situations can also be derived from this more conventional style of reasoning, or whether it requires some additional heuristic. In this regard, one should remember that, if we are to judge the rationality of a decision theory by payoffs obtained (“rationalists should win”), whether a heuristic is best or second-best may depend on the context (e.g. on the prior).
So let’s consider the present context. It seems that the two agents that are supposed to coordinate, using TDT, in order to avoid a supposedly predictable punishment by a FAI in the future, are yourself now and yourself in the future. We could start by asking whether these two agents are really similar enough for TDT to even apply. To repeat my earlier observations: just because a situation exists in which a particular heuristic for action produces an effective coordination of actions across distances of space and time, and therefore a higher payoff, does not mean that the heuristic in question is generally rational, or that it is a form of timeless decision theory. To judge whether the heuristic is rational, as opposed to just being lucky, we would need to establish that it has some general applicability, and that its effectiveness can be deduced by the situated agent. To judge whether employing a particular counterintuitive heuristic amounts to employing TDT, we need to establish that its justification results from applying the principles of TDT, such as “identity, or sufficient similarity, of agents”.
In this case, I would first question whether you-now and you-in-the-future are even similar enough for the principles of TDT to apply. The epistemic situation of the two is completely different: you-in-the-future knows the Singularity has occurred and a FAI has come into being, you-now does not know that either of those things will happen.
I would also question the generality of the heuristic proposed here. Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
Perhaps the bottom line is, how likely is it that a FAI would engage in this kind of “timeless precommitment to punish”? Because people now do not know what sort of super-AI, if any, the future will actually bring, any such “postcommitments” made by such an AI, after it has come into existence, cannot rationally be expected to achieve any good, in the form of retroactive influence on the past, not least because of the uncertainty about the future AI’s value system! This mode of argument—“you should have done more, because you should have been scared of what I might do to you one day”—could be employed in the service of any value system. Why don’t you allow yourself to be acausally blackmailed by a future paperclip maximizer?
Okay, I get the feeling that I might be completely wrong about this whole thing. But prior to saying “oops”, I’d like my position completely crushed, so I don’t have any kind of loophole or a partial retreat that is still wrong. This means I’ll continue to defend this position.
First of all, I got TDT wrong when I read about it here on lw. Oops. It seems like it is not applicible to the problem. Still I feel like my line of argument holds: If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
I’d call it friendly if it maximizes the expected utility of all humans, and if that involves blackmailing current humans who thought about this, so be it. Consider that the prior probability of a person doing X where X makes FAI happen a minute faster, generating Y additional utility, is 0.25. If this person, pondering the choices of an FAI, including punishing humans who didn’t speed up FAI development, is in the following more probable to do X (say, now 0.5), then the FAI might punish that human (and the human will anticipate this punishment) for up to 0.25 * Y utility for not doing X, and the FAI is still friendly. If the AI, however, decides not to punish that human, then either the human’s model of the AI was incorrect or the human correctly anticipated this behaviour, which would mean that the AI is not 100% friendly since it could have created utility by punishing that human.
The argument that there are many different types of AGI including those which reward those actions other AGIs punish neglects that the probabilities for different types of AI are spread unequally. I, personately, would assign a relatively high value to FAI (higher than a null hypothesis would suggest), so that the expected utilities don’t cancel out. While we can’t have absolute certainty about the actions of a future AGI, we can guess different probabilities for different mind designs. Bipping AIs might be more likely than Freepy AIs because so many people have donated to the fictional Institute on Bipping AI, whereas there is not even a thing such as a Freepy AI research center. I am uncertain about the value system of a future AGI, but not completely. A future paperclip maximizer is a mind design which I would assign a low probability to, and although the many different AGIs out there might together be more probable than FAI, every single one of them is unlikely compared to FAI, and thus, I should work towards FAI.
Where am I wrong? Where is this kind of argument flawed?
If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
But punishing them occurs after it has been created, and no action that it performs after it was created can cause it to have been created earlier than it was actually created. Therefore such post-singularity punishment is futile and a FAI would not perform it.
The only consideration in this scenario which can actually affect the time of an FAI’s creation is the pre-singularity fear of people who anticipated post-singularity punishment. But any actual future FAI is not itself responsible for this fear, and therefore not responsible for the consequences of that fear. Those consequences are entirely a product of ideas internal to the minds of pre-singularity people, such as ideas about the dispositions of post-singularity AIs.
Aside from the fact that I already changed my mind and came to the conclusion that an FAI won’t punish, I’d still object: In case we can anticipate an FAI which does not punish, we wouldn’t feel obliged (or be tempted to feel obliged) to speed up its development. That means that an AI would be better off to foreseeably punish people, and if the AI is friendly, then it has a mind design which maximizes the utility functions of humans. If that involves having a mind-design such that people anticipate punishment and thereby speed up its development, so is it. Especially the fact that we know it’s a friendly AI makes it very easy for us to anticipate its actions, which the AI knows as well. This line of argument still holds, the chain breaks at a weaker link.
We are not rational. If our utility function was perfectly utilitarian, without any of the common biases (scope insensitivity, hyperbolic discounting etc.), you’d be correct, the FAI wouldn’t blackmail us. But since we are not perfectly rational, an FAI might well come to the conclusion that blackmailing us will help, and it’ll punish those who didn’t devote their life to FAI development although they had the required intelligence to contribute and knew about the problem of FAI creation. This punishment will not be 3^^^3 units of disutility, but it will still be enough so we don’t want to suffer from this.
This seems like one of those times where we can choose to be rational—the outcome depends only on your strategy about this one thing, not on your general rationality. So if you are fine with being blackmailed (or, on the lighter side, bribed) by a future FAI, you can do that, but it’s not necessary, given a fairly small amount of willpower.
I think we currently experience an inferential gap—I feel that you don’t understand me, and I don’t understand you. What do you want to tell me? I don’t see how one could choose to “deny” that blackmail, once you realize that a future FAI may punish you, I don’t see anything one could do, not even one of the points from (1) to (4), other than accepting the blackmail.
When there’s a high enough chance someone won’t give in even if you use the horrible strategy, it’s not cost-effective to do the horrible strategy. Or even if you model doing the horrible strategy on everyone who realizes ever always, it is made cost-ineffective by group rejection. So if you’re an AI choosing between winning strategies, and the horrible strategy turns out to be a net loss in your models, you won’t do it. Therefore, stand strong! :P
I feel like this is something like burning a blackmail letter and then pretending to never have read it. If I know that the person in question has at some time understood the blackmail letter, but then deliberately burnt it and tried to forget about it, I will still impose the punishment.
Why should a “horrible strategy” (which might not even be so very horrible, being punished is probably still better than an FAI not existing) be a net loss? Even if you don’t understand the blackmailing or refuse to give in, FAI development is still a very important thing to do, and if you accept it and act accordingly, that’s beneficial to the FAI and therefore for all humans. So that’s 0 loss against a slight gain.
Also, I wouldn’t even want to refuse the blackmailing, because accepting it will be beneficial to humanity (in case I act accordingly, that is).
I don’t know what you mean by “group rejection”, a search didn’t bring up any results.
The blackmail letter hasn’t been sent, though, when you think about it. Until the AI actually is created and chooses a strategy, bargaining doesn’t stop. But it’s true that you have to be willing to counterfactually burn blackmail letters if you want to not receive them in the first place.
It’s not 0 loss for the AI. Especially if it’s friendly, doing horrible things to people goes against its other goals. It would like it better if you were happy, and besides doing horrible things takes resources, so there is definitely a cost.
“Group rejection” just meant “nearly everyone doesn’t give in.”
And yeah, if you want to be blackmailed, go for it :P
The blackmail letter hasn’t been sent, though, when you think about it.
Again, I am not convinced by this argument. I’d argue that the blackmail letter has been received, in some way, analogous to a normal blackmail letter, if you think about that an FAI might do this to make itself happen faster. If you realize that it might punish you in some way if this prospect had the chance of altering your actions, you have opened and read the letter. So the only possibility how you can resist the blackmailing is not to think about it in the first place. Once you have thought about it, however, I really don’t see what one could do.
Imagine I have a pill which turns me into a mindless zombie (not the philosophical one) for 3 hours. Taking this pill also leaves you with amnesia after I’m back to normal. I know that society punishes killing people. If I now deliberately take that pill (analogous to deliberately refusing the blackmailing without good reasons other than “I don’t want to get blackmailed”), can I expect punishment?
This also has a good application in reality: Drinking is not illegal. Committing crimes is. But in Germany, if one drinks oneself beyond a certain point (in a drunken stupor), one is considered not criminally responsible for ones actions. However, one can be punished for drinking oneself to this point.
Likewise, if you deliberately try to forget about that blackmail once you have thought about it, the future FAI might consider the deliberate act of forgetting punishworthy.
It’s not 0 loss for the AI. Especially if it’s friendly, doing horrible things to people goes against its other goals. It would like it better if you were happy, and besides doing horrible things takes resources, so there is definitely a cost.
Consider that a punishment influences my actions in a way that where I was beforehand not quite likely to speed up FAI development by 1 minute by doing X, creating Y extra utility, and after considering the blackmailing, I am much more likely to do X. How large is the punishment that the FAI may impose on me without becoming unfriendly? It’s greater than zero, because if the AI, by punishing me with Y-1 utility (or threatening to punish me, that is), gains an expected utility of Y that it would otherwise not gain, it will definitely threaten to punish me. Note that the things the FAI might do to someone are far from being horrible, post singularity might just be a little less fun, but enough that I’d prefer doing X.
If nearly everyone doesn’t give in after thinking about it, then indeed the FAI will only punish those who were in some way influenced by the punishment, although “deliberately not giving in merely because one doesn’t want to be blackmailed” is kind of impossible, see above.
And yeah, if you want to be blackmailed, go for it :P
I have to assume that this (speeding up FAI development) is best in any case.
I’d argue that the blackmail letter has been received, in some way, analogous to a normal blackmail letter, if you think about that an FAI might do this to make itself happen faster
You are simply mistaken. The analogy to blackmail may be misleading you—maybe try thinking about it without analogy. You might also read up on the subject, for example by reading Eliezer’s TDT paper
I’d like to see other opinions on this because I don’t see that we are proceeding any further.
I now read important parts of the TDT paper (more than just the abstract) and would say I understood at least those parts, though I don’t see anything that would contradict my considerations. I’m sorry, but I’m still not convinced. The analogies serve as a way to make the problem better graspable to intuition, but initially I thoguht about this without such analogies. I still don’t get where my reasoning is flawed. Could you try different approaches?
Hm. Actually, if you think about the following game, where A is the AI and B is the human:
~A1~A2 ~Bx~+9,-1+10,-1 ~By~ −1,-10+0,+0
The Nash equilibrium of the game is A2,By—that is, not horrible and doesn’t give in.
But if we have two agents facing off that don’t make moves independently, but instead choose winning strategies, there are multiple equilibria. I should really read Strategy of Conflict. The initiative to choose a particular equilibrium, however, is ours for the taking, for obvious temporal reasons. If we choose one of the equilibrium strategies, we dictate the other equilibrium strategy to the AI.
You are probably correct—if it’s possible to plausibly precommit oneself to under no circumstances be influenced by any type of blackmailing, then and only then doesn’t it make sense for the AI to threaten to punish people, that is, then an AI which punishes non-helping persons who precommited themselves to helping under no circumstances is unlikely. The problem is that precommitting oneself to under no circumstances helping might be very hard: An AI will still assign a probability greater than zero to the possibility that I can be influenced by the blackmailing and the more this probability approaches zero, the more will the amount of expected utility for the AI in case that it manages to convince me converge to Y, which means that the punishment I have to expect in case I don’t help will also converge to Y.
But wait! As the probability that I’m influenced by the AI shrinks, the probability that it without any good incentive imposes a punishment converging to Y grows, but since we consider a friendly AI, this will also impose a negative expected utility converging to Y on the AI. This should mean that the expected punishment should shrink much faster the more the probability of my actions being influenced by the AI, as rated by the AI, goes down. Thusly, the more convincingly I state that I won’t be influenced by any kind of blackmailing, the more rapidly will the expected punishment shrink, effectively becoming a minor inconvenience or less, since the AI will also assume that I’ll try to avoid punishment and thusly regulate the probability of me being influenced even further down.
I hope I’m correct at this one. If so, thank you, Manfred.
I hereby state that I will under no circumstances be influenced by any kind of acausal threatening from any kind of future Artificial Intelligence, which means that imposing any kind of punishment upon me is rendered completely useless.
The line of reasoning I followed before coming to this insight was flawed, and I dearly have to apologize for wasting your precious computing power, probable future AI, and your time and effort, Manfred, Mitchell Porter, Multiheaded.
Attempt to summarize: a FAI would likely be too nice to bother with not-perfectly-effective torture, and I suspect that it’s a big ol’ softie, so I’m not feeling threatened much, making it even less effective, and so on in a positive feedback loop.
You can say: “Screw it, monsters from the future don’t dictate my actions, period”. This is expected to make any such pre-commitment to punish you pointless, as its threats no longer affect your behavior.
As someone mentioned, it’s like playing chicken against a remotely controlled car on a collision course with yours; you have everything to lose while the opponent’s costs are much less, but if you don’t EVER chicken out, it loses out slightly and gains nothing with such a strategy. Therefore, if it has a high opinion of your willpower, it’s not going to chose that strategy.
Well, if the FAI knows that you thought about this but then rejected it, deliberately trying to make that pre-commitment pointless, that’s not a reason not to punish you. It’s like burning a blackmail letter; if you read the blackmail letter and the blackmailer knows this, he will still punish you.
In that chicken game it’s similar: If I knew that the opponent would punish me for not chickening out and then deliberately changed myself so that I wouldn’t know this, the opponent will still punish me—because I deliberately chose not to chicken out when I altered myself.
Also, creating FAI is in my best interest, so I’d want to chicken out even if I knew the opponent would chicken out as well. The only case in which blackmailing is useless is if I always chicken out (=work towards FAI), or if it doesn’t influence my actions because I’m already so altruistic that I will push for FAI regardless of my personal gains/losses, but we are humans, after all, so it probably will.
Should I have gone with my early forgetfulness and not mentioned the second-order improbable scary defense, or is there some other flaw/bunch of flaws?
EDIT: Okay, as of 6:18 LW time, the post is pretty much final. There were some changes before this as I realized a few things. Any big changes after this will be made explicit, sort of like this edit.
This is all very good reasoning, and you’re not the first to think of it, but for reasons that are obvious if you think some more about it it’s generally considered a REALLY bad idea to talk about it in public. PMing or emailing official SIAI people should get to link to safer avenues to discussing these kinds of basilisks.
Well, that’s kind of what the post is against. I think the outlook is quite positive, really; although there are a few arguments for general bad stuff (and really, when isn’t there a 10^-15 chance of horrible stuff), when it comes to actual decisions it’s irrational to be blackmailed in this way.
It would be more sensible to check with other people, rather than assuming it’s safe, before exposing the public to something that you know that a lot of people believe to be dangerous.
The pieces of the puzzle that Manfred put together can all be found on lesswrong. What do you suggest, that research into game and decision theory be banned?
You’re being facetious. No one is seriously disputing where the boundary between basilisk and non-basilisk lies, only what to do with the things on the basilisk side of the line.
This assumes that everyone knows where the boundary lies. The original post by Manfred either crossed the boundary or it didn’t. In the case that it didn’t, it only serves as a warning sign of where not to go. In the case that it did, how is your knowledge of the boundary not a case of hindsight bias?
(The voters seem to think I’m being stupid, but that doesn’t actually tell me what the right answer is...)
First, I’m assuming that, in general, people who have not seen the basilisk are not going to mention it accidentally.
Second, I’m assuming that, due to the nature of the basilisk, those who have seen it know what is and is not basilisk-information.
Which of these two assumptions do you disagree with? (please check all that apply)
Hmm, should I vote you up because what you’re saying is true, or should I vote you down because you are attracting attention to the parent post which harmful to think about?
If an idea is guessable, then it seems irrational to think it is harmful to communicate it to somebody, since they could have guessed it themselves. Given that this is a website about rationality, IMO we should be able to talk about the chain of reasoning that leads to the decision that this guessable idea is harmful to communicate, since there’s clearly a flaw in there somewhere.
Upvoted the parent because I think the harm here is imaginary. Absurdly large utilities do not describe non-absurdly-large brains, but they are not a surprising output from humans displaying fitness. (Hey, I know a large number! Look at me!)
These ideas have come up and were suppressed before, so this is not a specific criticism of the original post.
The solution to this problem is very simple. So, you have imagined a possible AI which will do terrible things unless you accede to its wishes? Just remember all the other equally possible AIs which will do (X) unless you do (Y), for all possible values of X and Y.
Knowing that the decision will be generated by an AI which will be built by humans (rather than, say, by rolling lots of dice), thoroughly upsets the symmetry between one choice and its alternatives. For example, if hacking a certain computer will best expand an AI’s available resources, the AI will send a very imbalanced selection of messages to that computer, even though there are many possible messages it could send. Your reminder that all “will do (X) unless you do (Y)” constructions are possible is similar to the reminder that all strings of bits are possible messages to the target computer. You should still expect either nothing, or a virus.
The solution is more like realizing that “you have imagined a possible AI which will do terrible things unless you accede to its wishes” is not actually the correct description of the problem—instead, your choice is like forming an unbreakable rule for how you’ll respond to kidnappings, even before the kidnapper is born, let alone contemplates a crime.
This “horrible strategy” is basically built on a kind of farcical newcomb’s problem where both boxes are transparent, you’re the one filling them with money, and omega is the one picking. It doesn’t work at all.
If people would stop cranking their hypothetical scenarios up to 11 in the interest of “seriousness”, maybe they could stop hyperventilating long enough to notice these things...
What it’s really like is a payoff table where the AI options are “horrible” and “not,” and the human options are “give in” and “don’t.” The payoffs look something like this, where the numbers are AI payoff, human payoff:
~ H~~ N~
G +10,-1 ~~ +11,-1
D −1,-10 ~ 0,0
The AI has to spend resources if it wants to be horrible, and we’re assuming that the situation is as usually described, before talking about things like (2) (3) or (4). Note the Nash equilibrium at “not horrible, don’t give in” - if the AI moves left it’s worse for it, and if the humans move up it’s worse for them. So everything should be fine, right? And yet some people still give in to hostage takers in the real world, a similar situation. Once the hostage has already been taken, i.e. H is chosen, your typical locally-reasoning entity will give in, thus making hostages useful. The way to keep options open and keep things at the Nash equilibrium is to not reason locally—to choose winning strategies rather than just reasoning locally. There’s a nice symmetry to this, since the only way a future AI could think to bargain with you this way is by choosing winning strategies, not just local moves. When you both do it, you go back to the Nash equilibrium.
The only way you can “bargain” with somebody from the past is if you can be PRESENT in the past in a simulation they’re running. That’s how newcomb’s problem works, omega is simulating you at time 0, and via that simulation, you have two-way interaction.
In this scenario, YOU have to be simulating THE AI at time 0, in your human imagination. This is not possible. The fact that the payoff matrix is simple does not make the opponent’s reasoning simple, and in order for “bargaining” to happen, their reasoning has to be not only simple, but, crucially, tied to your actions. The situation is, exactly as I described, newcomb’s problem with you preparing the boxes and omega picking.
Entirely possible, since I only need to prove a fact about their decision theory, not simulate it in real time—though it may mean that it’s a smaller subset of possible AIs. But if we allow unbounded utility, any finite probability is enough to blackmail with.
As for this being like Newcomb’s problem—no it’s not, the payoff matrix is different.
EDIT: Well, I guess it is sort of similar. But “sort of similar” isn’t enough, it really is a different game.
You’re right about the payoff matrix, I guess newcomb’s problem doesn’t have a payoff matrix at all, since there’s no payoff defined for the person filling the boxes.
What do you mean by “prove a fact about their decision theory”? Do you mean that you’re proving “a rational AI would use decision theory X and therefore use strategy Y”, or do you only mean “GIVEN that an AI uses decision theory X, they would use strategy Y”?
There seems to be a belief floating around this site that an AI could end up using any old kind of decision theory, depending on how it was programmed. Do you subscribe to this?
The “horrible strategy”, newcomb’s problem, and TDT games in general only make sense if player 2 (the player who acts second) can be simulated by player 1 in enough detail that player 1 can make a choice which is CONDITIONAL on player 2′s choice. A choice which, to reiterate, they have not yet actually made, and which they have an incentive to make differently than they are expected to.
The difficulty of achieving this may best be illustrated with an example. Here player 1, a human, and player 2, omega, are playing newcomb’s problem.
Player 1: “Omega is super-intelligent and super-rational. He uses updateless decision theory. Therefore, he will pick the winning strategy rather than the winning local move. Ergo, he will only take the opaque box. So, I should put money in both boxes. A = A.”
Player 2: “Chump.” [takes both boxes]
But of course, this wouldn’t REALLY happen, because if omega reasoned locally like this, we’d somehow be able to predict that, right? and so we wouldn’t put money in the box, right? And so, being rational, he wouldn’t want to act like that, because then he’d get less money. So, he’d definitely one-box. Whew, glad we reasoned through that. Let’s put the money in both boxes now.
Player 2: “Chump.” [takes both boxes]
The problem is, he can keep doing this no matter how fancy our reasoning gets, because at the end of the day, WE CAN’T SIMULATE HIS THINKING. It’s not enough to do some handwavy reasoning about decision theories and payoff matrixes and stuff, in order to do a UDT bargain, we have to actually be able to actually simulate his brain. To not just see his thinking on the horizon, as it were, but to be A STEP AHEAD. And this we cannot do.
Nope. For example, humans do this sort of reasoning in games like the ultimatum game, and I can’t simulate a human completely any more than I can simulate an AI completely. All you really need to know is what their options are and how they choose between options.
Actually come to think of it, an even better analogy than a switched up newcomb’s problem is a switched up parfit’s hitchhiker. The human vs. human version works, not perfectly by any means, but at least to some extent, because humans are imperfect liars. You can’t simulate another human’s brain in perfect detail, but sometimes you can be a step ahead of them.
If the hitchhiker is omega, you can’t. This is a bad thing for both you and omega, but it’s not something either of you can change. Omega could self-modify to become Omega+, who’s just like omega except that he never lies, but he would have no way of proving to you that he had done so. Maybe omega will get lucky, and you’ll convince yourself through some flawed and convoluted reasoning that he has an incentive to do this, but he actually doesn’t, because there’s no possible way it will impact your decision.
Consider this. Omega promises to give you $500 if you take him into town, you agree, when you get to town he calls you a chump and runs away. What is your reaction? Do you think to yourself “DOES NOT COMPUTE”?
Omega got everything he wanted, so presumably his actions were rational. Why did your model not predict this?
Well, if I’m playing the part of the diver right, in order for me to do it in the first place I’d have to have some evidence that Omega was honest. Really I only need a 10% or so chance of him being honest to pick him up. So I’d probably go “my evidence was wrong, dang, now I’m out the $5 for gas and the 3 utilons of having to ride with that jerk Omega.” This would also be new evidence that changed my probabilities by varying amounts.
So the analogy is that giving the ride to the bad AI is like helping it come into existence, and it not paying is like it doing horrible things to you anyway? If that’s the case, I might well think to myself “DOES NOT COMPUTE.”
I see no reason why such scenarios would be considered farcical. If anything, the normal one is the worse one and the one you suggest is something I could actually pull of right now if I had a motive to test something like that.
Oh really? You have an omega sitting around you can test game theory problems with? Omniscient super-intelligent being, maybe in your garage or something?
Seriously though, for the decision of the person who picks the box to influence the person who puts in the money, the person who puts in the money has to be able to simulate the thinking of the person who picks the box. That means you have to simulate the thinking of Omega. Given that omega is smart enough to simulate YOUR thinking in perfect detail, this is patently impossible.
The only reason for omega to two-box is if your decision is conditional on his decision, and much as he might wish it was, no amount of super-intelligence or super-rationality on his part is going to give you that magical insight into his mind. He knows whether you put the money in the box, and he knows that which box he picks has no influence on it.
1) Not in my garage, but this kind of thing doesn’t have a range limit.
2) For a sufficiently broad definition of “simulate”, then yes I can. That broad definition is sufficient.
3) Who are you to say what omega cannot do?
(Yea, I’ve used a bit of dark arts to make it sound more impressive than it actually is, but the point still stands.)
I take it your implication is that you could play the game with a superintelligent entity somewhere far in spacetime. If this is your plan, how exactly are you going to get the results back? Not really a test if you don’t get results.
No, it’s not. You might be able to guess that a superintelligence would like negentropy and be ambivalant toward long walks on the beach, but this kind of “simulating” would never, ever, ever, ever allow you to beat it at paper scissor rock. Predicting which square of a payoff matrix it will pick, when it is to the interest of the AI to pick a different square than you think it will, is a problem of the latter type.
This is a general purpose argument against all reasoning relating to superintelligences, and aids your argument no more than mine.
There’s a simpler way to pose the problem, that I think raises the same issues: “What happens if somebody chooses to build an unfriendly AI programmed to benefit the creator at the expense of the rest of the world?”
Nope, not like that at all. What he’s talking about is knowledge that’s objectively harmful for someone to have.
Someone should make a list of knowledge that is objectively harmful. Could come in handy if you want to avoid running into it accidentally. Or we just ban the medium that is used to spread it, in this case natural language.
What about Ethical AIs? They would quite be able (and completely moral, from a utilitarian point of view!) to blackmail us, and it actually makes sense so we should let ourselves be blackmailed, sort of.
Hm. This would get around (2) and (4), since rather than a tug of war to get created, it’s a many-dimensional tug of war to enforce values. But (1) and (3) are still in force. So since the AI is supposed to be ethical, it’s probably safe to say that not only shouldn’t a rational agent change their actions, ethical AIs that have anything against doing horrible things won’t execute the horrible strategy on values.
Though this still doesn’t exclude the possibility of some more laid-back demerits/rewards system for past behavior implemented by a future AI, which gets around the instinctive part of (1) and all of (3), and is just damped by the fact that it’s a priori improbable.
Why, every minute FAI is delayed is a minute where people suffer and die, creating X disutility. To “encourage” people creating FAI as fast as possible, a FAI might well assign up to X disutility to any person not taking an action which could speed up FAI development by one minute. This is not a priori improbable, it’s not unethical either. In fact, blackmailing us is a logical thing to do. We can safely assume that a future FAI will want to be created as fast as possible and will not be reluctant to blackmail us for the good of all humans.
We can safely assume that a future FAI does not exist until it exists, and therefore cannot do anything to make itself come into being faster than it actually did. A presently nonexistent entity cannot make commitments about what it will do once it gets to exist, and missed opportunities which occurred before an FAI’s creation are sunk costs and there would be no point in punishing them.
Hey, what if the future FAI punishes you for making half-baked arguments in the public domain, thereby panicking people, decreasing their rationality, and thereby decreasing the probability of FAI?
If I can model the future FAI with enough accuracy and if TDT turns out to be true, then I can indeed draw the conclusion that it will punish people who know about the importance of FAI but failed to act accordingly.
Also, by my “half baked arguments in the public domain” (which is, in fact, limited to those very few people digging through a discussion post’s comments), I don’t think I panick anyone (if I do, please tell me why), merely thinking about this should not be a reason to panick. It’s at least equally likely that people thinking about this come to the conclusion that the FAI will probably do this and therefore do something to speed up FAI development (e.g. donate to SIAI).
The point of TDT is that you act as if you were deciding, not just on your own behalf, but on behalf of all agents sufficiently identical to you.
It has always seemed to me that the same decisions should be obtainable from ordinary decision theory, if you genuinely take into account the uncertainty about who and what you are. There are many possible worlds containing an agent whose experience is subjectively indistinguishable from yours; an idealized rationality, applied to an agent in your subjective situation, would actually assign some probability to each of those possibilities; and hence, the agents in all those worlds “should” make the same decision (but won’t, because they aren’t all ideally rational). There remains the question of whether the higher payoff that TDT obtains in certain extreme situations can also be derived from this more conventional style of reasoning, or whether it requires some additional heuristic. In this regard, one should remember that, if we are to judge the rationality of a decision theory by payoffs obtained (“rationalists should win”), whether a heuristic is best or second-best may depend on the context (e.g. on the prior).
So let’s consider the present context. It seems that the two agents that are supposed to coordinate, using TDT, in order to avoid a supposedly predictable punishment by a FAI in the future, are yourself now and yourself in the future. We could start by asking whether these two agents are really similar enough for TDT to even apply. To repeat my earlier observations: just because a situation exists in which a particular heuristic for action produces an effective coordination of actions across distances of space and time, and therefore a higher payoff, does not mean that the heuristic in question is generally rational, or that it is a form of timeless decision theory. To judge whether the heuristic is rational, as opposed to just being lucky, we would need to establish that it has some general applicability, and that its effectiveness can be deduced by the situated agent. To judge whether employing a particular counterintuitive heuristic amounts to employing TDT, we need to establish that its justification results from applying the principles of TDT, such as “identity, or sufficient similarity, of agents”.
In this case, I would first question whether you-now and you-in-the-future are even similar enough for the principles of TDT to apply. The epistemic situation of the two is completely different: you-in-the-future knows the Singularity has occurred and a FAI has come into being, you-now does not know that either of those things will happen.
I would also question the generality of the heuristic proposed here. Yes, if there will one day be an AI (I can’t call it friendly) which decides to punish people who could have done more to bring about a friendly singularity, then it would be advisable to do what one can, right now, in order to bring about a friendly singularity. But this is only one type of possible AI.
Perhaps the bottom line is, how likely is it that a FAI would engage in this kind of “timeless precommitment to punish”? Because people now do not know what sort of super-AI, if any, the future will actually bring, any such “postcommitments” made by such an AI, after it has come into existence, cannot rationally be expected to achieve any good, in the form of retroactive influence on the past, not least because of the uncertainty about the future AI’s value system! This mode of argument—“you should have done more, because you should have been scared of what I might do to you one day”—could be employed in the service of any value system. Why don’t you allow yourself to be acausally blackmailed by a future paperclip maximizer?
Okay, I get the feeling that I might be completely wrong about this whole thing. But prior to saying “oops”, I’d like my position completely crushed, so I don’t have any kind of loophole or a partial retreat that is still wrong. This means I’ll continue to defend this position.
First of all, I got TDT wrong when I read about it here on lw. Oops. It seems like it is not applicible to the problem. Still I feel like my line of argument holds: If you know that a future FAI will take all actions necessary that lead to its faster creation, you can derive that it will also punish those who knew it would, but didn’t make FAI happen faster.
I’d call it friendly if it maximizes the expected utility of all humans, and if that involves blackmailing current humans who thought about this, so be it. Consider that the prior probability of a person doing X where X makes FAI happen a minute faster, generating Y additional utility, is 0.25. If this person, pondering the choices of an FAI, including punishing humans who didn’t speed up FAI development, is in the following more probable to do X (say, now 0.5), then the FAI might punish that human (and the human will anticipate this punishment) for up to 0.25 * Y utility for not doing X, and the FAI is still friendly. If the AI, however, decides not to punish that human, then either the human’s model of the AI was incorrect or the human correctly anticipated this behaviour, which would mean that the AI is not 100% friendly since it could have created utility by punishing that human.
The argument that there are many different types of AGI including those which reward those actions other AGIs punish neglects that the probabilities for different types of AI are spread unequally. I, personately, would assign a relatively high value to FAI (higher than a null hypothesis would suggest), so that the expected utilities don’t cancel out. While we can’t have absolute certainty about the actions of a future AGI, we can guess different probabilities for different mind designs. Bipping AIs might be more likely than Freepy AIs because so many people have donated to the fictional Institute on Bipping AI, whereas there is not even a thing such as a Freepy AI research center. I am uncertain about the value system of a future AGI, but not completely. A future paperclip maximizer is a mind design which I would assign a low probability to, and although the many different AGIs out there might together be more probable than FAI, every single one of them is unlikely compared to FAI, and thus, I should work towards FAI.
Where am I wrong? Where is this kind of argument flawed?
But punishing them occurs after it has been created, and no action that it performs after it was created can cause it to have been created earlier than it was actually created. Therefore such post-singularity punishment is futile and a FAI would not perform it.
The only consideration in this scenario which can actually affect the time of an FAI’s creation is the pre-singularity fear of people who anticipated post-singularity punishment. But any actual future FAI is not itself responsible for this fear, and therefore not responsible for the consequences of that fear. Those consequences are entirely a product of ideas internal to the minds of pre-singularity people, such as ideas about the dispositions of post-singularity AIs.
Aside from the fact that I already changed my mind and came to the conclusion that an FAI won’t punish, I’d still object: In case we can anticipate an FAI which does not punish, we wouldn’t feel obliged (or be tempted to feel obliged) to speed up its development. That means that an AI would be better off to foreseeably punish people, and if the AI is friendly, then it has a mind design which maximizes the utility functions of humans. If that involves having a mind-design such that people anticipate punishment and thereby speed up its development, so is it. Especially the fact that we know it’s a friendly AI makes it very easy for us to anticipate its actions, which the AI knows as well. This line of argument still holds, the chain breaks at a weaker link.
Even if the blackmailer is a nice guy, present rational agents still, from their perspectives, shouldn’t change their decisions.
We are not rational. If our utility function was perfectly utilitarian, without any of the common biases (scope insensitivity, hyperbolic discounting etc.), you’d be correct, the FAI wouldn’t blackmail us. But since we are not perfectly rational, an FAI might well come to the conclusion that blackmailing us will help, and it’ll punish those who didn’t devote their life to FAI development although they had the required intelligence to contribute and knew about the problem of FAI creation. This punishment will not be 3^^^3 units of disutility, but it will still be enough so we don’t want to suffer from this.
This seems like one of those times where we can choose to be rational—the outcome depends only on your strategy about this one thing, not on your general rationality. So if you are fine with being blackmailed (or, on the lighter side, bribed) by a future FAI, you can do that, but it’s not necessary, given a fairly small amount of willpower.
I think we currently experience an inferential gap—I feel that you don’t understand me, and I don’t understand you. What do you want to tell me? I don’t see how one could choose to “deny” that blackmail, once you realize that a future FAI may punish you, I don’t see anything one could do, not even one of the points from (1) to (4), other than accepting the blackmail.
When there’s a high enough chance someone won’t give in even if you use the horrible strategy, it’s not cost-effective to do the horrible strategy. Or even if you model doing the horrible strategy on everyone who realizes ever always, it is made cost-ineffective by group rejection. So if you’re an AI choosing between winning strategies, and the horrible strategy turns out to be a net loss in your models, you won’t do it. Therefore, stand strong! :P
I feel like this is something like burning a blackmail letter and then pretending to never have read it. If I know that the person in question has at some time understood the blackmail letter, but then deliberately burnt it and tried to forget about it, I will still impose the punishment.
Why should a “horrible strategy” (which might not even be so very horrible, being punished is probably still better than an FAI not existing) be a net loss? Even if you don’t understand the blackmailing or refuse to give in, FAI development is still a very important thing to do, and if you accept it and act accordingly, that’s beneficial to the FAI and therefore for all humans. So that’s 0 loss against a slight gain.
Also, I wouldn’t even want to refuse the blackmailing, because accepting it will be beneficial to humanity (in case I act accordingly, that is).
I don’t know what you mean by “group rejection”, a search didn’t bring up any results.
The blackmail letter hasn’t been sent, though, when you think about it. Until the AI actually is created and chooses a strategy, bargaining doesn’t stop. But it’s true that you have to be willing to counterfactually burn blackmail letters if you want to not receive them in the first place.
It’s not 0 loss for the AI. Especially if it’s friendly, doing horrible things to people goes against its other goals. It would like it better if you were happy, and besides doing horrible things takes resources, so there is definitely a cost.
“Group rejection” just meant “nearly everyone doesn’t give in.”
And yeah, if you want to be blackmailed, go for it :P
Again, I am not convinced by this argument. I’d argue that the blackmail letter has been received, in some way, analogous to a normal blackmail letter, if you think about that an FAI might do this to make itself happen faster. If you realize that it might punish you in some way if this prospect had the chance of altering your actions, you have opened and read the letter. So the only possibility how you can resist the blackmailing is not to think about it in the first place. Once you have thought about it, however, I really don’t see what one could do.
Imagine I have a pill which turns me into a mindless zombie (not the philosophical one) for 3 hours. Taking this pill also leaves you with amnesia after I’m back to normal. I know that society punishes killing people. If I now deliberately take that pill (analogous to deliberately refusing the blackmailing without good reasons other than “I don’t want to get blackmailed”), can I expect punishment?
This also has a good application in reality: Drinking is not illegal. Committing crimes is. But in Germany, if one drinks oneself beyond a certain point (in a drunken stupor), one is considered not criminally responsible for ones actions. However, one can be punished for drinking oneself to this point.
Likewise, if you deliberately try to forget about that blackmail once you have thought about it, the future FAI might consider the deliberate act of forgetting punishworthy.
Consider that a punishment influences my actions in a way that where I was beforehand not quite likely to speed up FAI development by 1 minute by doing X, creating Y extra utility, and after considering the blackmailing, I am much more likely to do X. How large is the punishment that the FAI may impose on me without becoming unfriendly? It’s greater than zero, because if the AI, by punishing me with Y-1 utility (or threatening to punish me, that is), gains an expected utility of Y that it would otherwise not gain, it will definitely threaten to punish me. Note that the things the FAI might do to someone are far from being horrible, post singularity might just be a little less fun, but enough that I’d prefer doing X.
If nearly everyone doesn’t give in after thinking about it, then indeed the FAI will only punish those who were in some way influenced by the punishment, although “deliberately not giving in merely because one doesn’t want to be blackmailed” is kind of impossible, see above.
You are simply mistaken. The analogy to blackmail may be misleading you—maybe try thinking about it without analogy. You might also read up on the subject, for example by reading Eliezer’s TDT paper
I’d like to see other opinions on this because I don’t see that we are proceeding any further.
I now read important parts of the TDT paper (more than just the abstract) and would say I understood at least those parts, though I don’t see anything that would contradict my considerations. I’m sorry, but I’m still not convinced. The analogies serve as a way to make the problem better graspable to intuition, but initially I thoguht about this without such analogies. I still don’t get where my reasoning is flawed. Could you try different approaches?
Hm. Actually, if you think about the following game, where A is the AI and B is the human:
~A1~A2
~Bx~+9,-1+10,-1
~By~ −1,-10+0,+0
The Nash equilibrium of the game is A2,By—that is, not horrible and doesn’t give in.
But if we have two agents facing off that don’t make moves independently, but instead choose winning strategies, there are multiple equilibria. I should really read Strategy of Conflict. The initiative to choose a particular equilibrium, however, is ours for the taking, for obvious temporal reasons. If we choose one of the equilibrium strategies, we dictate the other equilibrium strategy to the AI.
You are probably correct—if it’s possible to plausibly precommit oneself to under no circumstances be influenced by any type of blackmailing, then and only then doesn’t it make sense for the AI to threaten to punish people, that is, then an AI which punishes non-helping persons who precommited themselves to helping under no circumstances is unlikely. The problem is that precommitting oneself to under no circumstances helping might be very hard: An AI will still assign a probability greater than zero to the possibility that I can be influenced by the blackmailing and the more this probability approaches zero, the more will the amount of expected utility for the AI in case that it manages to convince me converge to Y, which means that the punishment I have to expect in case I don’t help will also converge to Y.
But wait! As the probability that I’m influenced by the AI shrinks, the probability that it without any good incentive imposes a punishment converging to Y grows, but since we consider a friendly AI, this will also impose a negative expected utility converging to Y on the AI. This should mean that the expected punishment should shrink much faster the more the probability of my actions being influenced by the AI, as rated by the AI, goes down. Thusly, the more convincingly I state that I won’t be influenced by any kind of blackmailing, the more rapidly will the expected punishment shrink, effectively becoming a minor inconvenience or less, since the AI will also assume that I’ll try to avoid punishment and thusly regulate the probability of me being influenced even further down.
I hope I’m correct at this one. If so, thank you, Manfred.
I hereby state that I will under no circumstances be influenced by any kind of acausal threatening from any kind of future Artificial Intelligence, which means that imposing any kind of punishment upon me is rendered completely useless.
The line of reasoning I followed before coming to this insight was flawed, and I dearly have to apologize for wasting your precious computing power, probable future AI, and your time and effort, Manfred, Mitchell Porter, Multiheaded.
Do I win? I hope I do.
Attempt to summarize: a FAI would likely be too nice to bother with not-perfectly-effective torture, and I suspect that it’s a big ol’ softie, so I’m not feeling threatened much, making it even less effective, and so on in a positive feedback loop.
You can say: “Screw it, monsters from the future don’t dictate my actions, period”. This is expected to make any such pre-commitment to punish you pointless, as its threats no longer affect your behavior.
As someone mentioned, it’s like playing chicken against a remotely controlled car on a collision course with yours; you have everything to lose while the opponent’s costs are much less, but if you don’t EVER chicken out, it loses out slightly and gains nothing with such a strategy. Therefore, if it has a high opinion of your willpower, it’s not going to chose that strategy.
Well, if the FAI knows that you thought about this but then rejected it, deliberately trying to make that pre-commitment pointless, that’s not a reason not to punish you. It’s like burning a blackmail letter; if you read the blackmail letter and the blackmailer knows this, he will still punish you.
In that chicken game it’s similar: If I knew that the opponent would punish me for not chickening out and then deliberately changed myself so that I wouldn’t know this, the opponent will still punish me—because I deliberately chose not to chicken out when I altered myself.
Also, creating FAI is in my best interest, so I’d want to chicken out even if I knew the opponent would chicken out as well. The only case in which blackmailing is useless is if I always chicken out (=work towards FAI), or if it doesn’t influence my actions because I’m already so altruistic that I will push for FAI regardless of my personal gains/losses, but we are humans, after all, so it probably will.
Oops, I just posted the same thing. Is it safe to read this bit, at least? I skipped straight to the comments.
EDIT: FOR ANYONE WHO NEVER KNEW: IT’S NOT SAFE TO READ! I view the risk as negligable, but real. Deleted my own post.
I obviously think it’s safe. Nobody’s actually told me what they think though.
Be careful to trust Manfred, he is known to have destroyed the Earth on at least one previous occasion.
Holy crap wtf.
Should I have gone with my early forgetfulness and not mentioned the second-order improbable scary defense, or is there some other flaw/bunch of flaws?
EDIT: Okay, as of 6:18 LW time, the post is pretty much final. There were some changes before this as I realized a few things. Any big changes after this will be made explicit, sort of like this edit.